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Abstract 

When an item goes out of stock, sales transaction data no longer reflect the original cus¬ 
tomer demand, since some customers leave with no purchase while others substitute alternative 
products for the one that was out of stock. Here we develop a Bayesian hierarchical model for 
inferring the underlying customer arrival rate and choice model from sales transaction data and 
the corresponding stock levels. The model uses a nonhomogeneous Poisson process to allow 
the arrival rate to vary throughout the day, and allows for a variety of choice models. Model 
parameters are inferred using a stochastic gradient MCMC algorithm that can scale to large 
transaction databases. We fit the model to data from a local bakery and show that it is able 
to make accurate out-of-sample predictions, and to provide actionable insight into lost cookie 
sales. 


1 Introduction 

An important common challenge facing retailers is to understand customer preferences in the pres¬ 
ence of stockouts. When an item is out of stock, some customers will leave, while others will 
substitute a different product. From the transaction data collected by retailers, it is challenging 
to determine exactly what the customer’s original intent was, or, because of customers that leave 
without making a purchase, even how many customers there actually were. 

The task that we consider here is to infer both the customer arrival rate, including the unobserved 
customers that left without a purchase, and the substitution model, which describes how customers 
substitute when their preferred item is out of stock. Furthermore, we wish to infer these from sales 
transaction and stock level data, which data are readily available for many retailers. These quantities 
are a necessary input for inventory management and assortment planning problems. 

Stockouts are a common occurrence in some retail settings, such as bakeries and flash-sale retailers 
[5]. Not properly accounting for the data truncation caused by stockouts can lead to poor stocking 
decisions. Naively estimating demand as the number of items sold underestimates the demand of 
items that stock out, while overestimating the demand of their substitutes. This could lead the 
retailer to set the stock for the substitute items too high, while leaving the stock of the stocked-out 
item too low, potentially losing customers and revenue. 

There are several key features of our model and inference that make it successful in problem 
settings where prior work in the area has not been. First, prior work has assumed the arrival 
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rate to be constant within each time period |24j . Our model allows for arbitrary nonhomogeneous 
arrival rate functions, which is important for our bakery case study where sales have strong peaks 
at lunch time and between classes. Second, prior work has required a particular choice model 
[Dima, whereas our model can incorporate whichever choice model is most appropriate. There 
are a wide variety of choice models, econometric models describing how a customer chooses one of 
several alternatives, with different properties and which are applicable in different settings. Third, 
we model multiple customer segments, each with their own substitution models which can be used 
to borrow strength across data from multiple stores. Fourth, unlike prior work which has used point 
estimates, our inference is fully Bayesian. Because we do full posterior inference, we are able to 
compute the posterior predictive distributions for decision quantities of interest, such as lost sales 
due to stock unavailability. This allows us to incorporate the uncertainty in estimation directly into 
uncertainty in our decision quantities, thus leading to more robust decisions. 

Our contributions are four-fold. First, we develop a Bayesian hierarchical model that uses the 
censoring caused by stockouts and their induced substitutions to gain useful insight from transaction 
data. Our model is flexible and powerful enough to be useful in a wide range of retail settings. 
Second, we show how recent advances in MCMC for topic models can be adapted to our model 
to provide a sampling procedure that scales to large transaction databases. Third, we provide a 
simulation study which shows that we can recover the true generating values and which demonstrates 
the scalability of the inference procedure. Finally, we make available actual retail transaction data 
from a bakerjQ and use these data for a case study showing how the model and sampling work in 
a real setting. In the case study we evaluate the predictive power of the model, and show that our 
model can make accurate out-of-sample predictions whereas the baseline method cannot. We finally 
show how the methods developed here can be useful for decision making by producing a posterior 
predictive distribution of the bakery’s lost sales due to stock unavailability. 

2 The Generative Model 

We begin by introducing the notation that we use to describe the observed data. We then introduce 
the nonhomogeneous model for customer arrivals, followed by a discussion of various possible choice 
models. Section |2.4| discusses how multiple customer segments are modeled. Finally, Section |2.5| 
introduces the likelihood model and Section [2?6| discusses the prior distributions. 

2.1 The Data 

We suppose that we have data from a collection of stores cr = 1,..., S'. For each store, data come 
from a number of time periods I = 1,..., L'^, throughout each of which time varies from 0 to T. For 
example, in our experiments a time period was one day. We consider a collection of items i = 1,... ,n. 
We suppose that we have two types of data: purchase times and stock levels. We denote the number 
of purchases of item i in time period I at store a as Then, we let be 

the observed purchase times of item i in time period I at store a. For notational convenience, we 

{ cr / 1 ^ 

I be the collection of all purchase times for that store and time period, and let 

t = {^'’^’* 11 = 1 ,. .,L<' be the complete set of arrival time data. 

^ a=i,.’..,S 

We denote the known initial stock level as and assume that stocks are not replenished 
throughout the time period. That is, and equality implies a stockout. As before, we let 

and N represent respectively the collection of initial stock data for store a and time period I, 
and for all stores and all time periods. 

^Data are available at http://github.com/bletham/bakery 
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Given and we can compute a stock indicator as a function of time. We define this 

indicator function as 




I 0 if item i is out of stock at time t 
I 1 if item i is in stock at time t. 


The generative model for these data will be that customers arrive at the store according to some 
arrival process. Each customer belongs to a particular segment, and chooses an item to purchase 
(or no-purchase) based on the preferences of his or her segment and the available stock. When the 
customer purchases item i, the arrival time is recorded in . When a customer leaves without 
making a purchase, for instance because his or her preferred item is out of stock, the arrival time is 
not recorded. We now present the two main components of this model: the customer arrival process 
and the choice model. 


2.2 Modeling Customer Arrivals 


We model the times of customer arrivals using a nonhomogeneous Poisson process (NHPP). An 
NHPP is a generalization of the Poisson process that allows for the intensity to be described by 
a function A(t) > 0 as opposed to being constant. We assume that the intensity function has 
been parameterized, with parameters potentially different for each store a. The most basic 
parameterization is A(t | rj'^) = 77 ^, producing a homogeneous Poisson process of rate ? 7 f. As 
another example, we can produce an intensity function that rises to a peak and then decays by 
letting 





( 1 ) 


which is the derivative of the Hill equation . 

The posterior of rj'^ will be inferred. To do this we use the log-likelihood function for NHPP 
arrivals, which for arrival times ti,..., tm over interval [0, T] is: 


logp(ti, = ^log(A(tj)) - A(o,r), 

i=i 


where A(0, T) = X{t)dt. Our model can incorporate any integrable rate function. We let t] = 
represent the complete collection of rate function parameters to be inferred. 


2.3 Models for Substitution Behavior 

Whether or not a customer purchases an item and which item they purchase depends on the 
stock availability as well as some choice model parameters which we describe below. We define 
fi{s{t),cj )^to be the probability that a customer purchases product i given the current stock 
s{t) and choice model parameters 4>^ and r*. Then, we denote the no-purchase probability as 

n 

i=l 


The in dex k indicates the parameters for a particular customer segment, which we will discuss in 
Section 2.4 Posterior distributions for the parameters 0^ and are inferred. 


Choice models are econometric models describing a customer’s choice between several alterna¬ 
tives, often derived from a utility maximization problem. Different assumptions and utility models 
lead to different choice models, which ultimately lead to a different form of the purchase probability 
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fi{s{t), 4>^, T^). Our model accommodates any choice model for which the purchase probabilities can 
be expressed as a function of the current stock. We now discuss how several common choice models 
fit into this framework, and we use these choice models in our simulation and data experiments. 


2.3.1 Multinomial Logit Choice 

The multinomial logit (MNL) is a popular choice model with parameters specifying a 

preference distribution over products, that is, > 0 and — 1- Each customer selects a 

product according to that distribution. When an item goes out of stock, substitution takes place 
by transferring purchase probability to the other items proportionally to their original probability, 
including to the no-purchase option. This model requires a proportion /{I + t^) of arrivals be 
no-purchases when all items are in stock. The MNL choice probabilities are: 








( 2 ) 


The MNL model parameter is not identifiable when the arrival function is also unknown, a serious 
disadvantage of this model [23] . 


2.3.2 Single-Substitution Exogenous Model 

The exogenous choice model overcomes many of the shortcomings of the MNL model, including the 
issue of parameter identifiability. According to the exogenous proportional substitution model [13] . 
a customer samples a first choice from the preference distribution <p^. If that item is available, he or 
she purchases the item. If the first choice is not available, with probability 1 —the customer leaves 
as no-purchase. With the remaining probability, the customer picks a second choice according to a 
preference vector that has been re-weighted to exclude the first choice. Specifically, if the first choice 
was j then the probability of choosing i as the second choice is (l>v- If tEe second choice is 

in stock it is purchased, otherwise the customer leaves as no-purchase. The purchase probability is 

fr{s{t)A\r'^) = ( 3 ) 

j=l Vv 

Posterior distributions for both (p^ and are inferred. 

Allowing for the no-purchase option only in the event of stockouts means that the inferred arrival 
rate will be that of customers who actually would have purchased an item had all items been in 
stock. It would be possible for the exogenous model to include a proportion of customers that make 
no purchase even with full stock, as is required by the MNL model. However, inasmuch as these 
customers make no contribution to sales regardless of stock, it serves no purpose in the ultimate 
goal of understanding the effect of stock on sales. 


2.3.3 Nonparametric Choice Model 

Nonparametric models describe preferences as an ordered set of items. Let cf>^ be an ordered subset 
of the items {!,... ,n}. Customers purchase pi if it is in stock. If not, they purchase (t )2 if if is in 
stock. If not, they continue substituting down cp^ until they reach the first item that is available. 
If none of the items in cp^ are available, they leave as a no-purchase. The purchase probability for 
this model is then 1 for the first in-stock item in <p^, and 0 otherwise. 

Because this model requires all customers to behave exactly the same, it is most useful when 
customers are modeled as coming from a number of different segments k, each with its own preference 
ranking cp^. This is precisely what we do in our model, as we describe in the next section. For the 
nonparametric model the rank orders for each segment <p^ are fixed and it is the distribution of 
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customers across segments that is inferred. We do not generally need to consider all possible rank 
orders, as we discuss in the next section. 

2.4 Segments and Mixtures of Choice Models 

We model customers as each coming from one of K segments k = 1,..., K, each with its own choice 
model parameters and r*. Let O'^ be the customer segment distribution for store a, with 0^ the 
probability that an arrival at store a belongs to segment k, > 0, and = 1- with 

other variables, we denote the collection of segment distributions across all stores as 6. Similarly, 
we denote the collections of choice model parameters across all segments as 4> and x. 

For the nonparametric choice model, each of these segments would have a different rank ordering 
of items and multiple segments are required in order to have a diverse set of preferences. For the MNL 
and exogenous choice models, customer segments can be used to borrow strength across multiple 
stores. All stores share the same underlying segment parameters (f> and r, but each store’s arrivals 
are represented by a different mixing of these segments, 0'^. This model allows us to use data from 
all of the stores for inferring the choice model parameters, while still allowing stores to differ from 
each other by having a different mixture of segments. 

With the nonparametric choice model, using a segment for each ordered subset of {!,...,n} 
would likely result in more parameters than could be reasonably inferred for n even moderately 
large. Our inference procedure would be most appropriate for nonparametric models with one or 
two substitutions (that is, ordered subsets of size 2 or 3), which could still capture a wide range of 
behaviors. 

2.5 The Likelihood Model 

We now describe in detail the generative model for how customer segments, choice models, stock 
levels, and the arrival function all interact to create transaction data. Consider store cr and time 
period 1. Customers arrive according to the NHPP for this store. Let i'[’\ ..., represent all of the 
arrival times; these are unobserved, as they may include no-purchases. Each arrival has probability 
B’l of belonging to segment k. They then purchase an item or leave as no-purchase according to 
the choice model fi. If the j’th arrival purchases an item then we observe that purchase at time 
tj’*; if they leave as no-purchase we do not observe that arrival at all. The generative model for the 
observed data t is thus: 

For store cr = 1,..., S' and time period ? = 1,..., L'’’ : 

• Sample arrivals ..., ~ NHPP(A(t | rj'’’), T). 

• For arrival j = 1,..., 

— Sample segment as fc ~ Multinomial(0'’^). 

— With probability \ TV'’’’*), cf)^, r*) purchase item i, or no purchase with i = 0. 

— If item f > 0 purchased, add the time to 

We denote the probability that an arrival at time t purchases item i as 'Ki{t) = I 

TV'’^’*), T*'). An important quantity for the likelihood is the observed purchase rate, which is 

the arrival rate times the purchase probability: 

Xf{t) = x{t I vlMsit I r-*, Ar'’’*),c/.,x, 0 '’). (4) 

This is the rate at which customers purchase item i, incorporating stock availability and customer 
choice. The corresponding mean function is A^’*(0,r) = X'^'’’{t)dt. 

The following theorem gives the likelihood function corresponding to this generative model. 
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Theorem 1. The log-likelihood function oft is 


\ogp{t \T],9,(f),T,N,T) 

= ESE fs - Ar'(o,r) 

a=l ;=1 i=l \j=l 

Remarkably, the result is that which would be obtained if we treated the purchases for each item 
as independent NHPPs with rate Xf'\f), the observed purchase rate from Q. In reality, they are not 
independent NHPPs inasmuch as they depend on each other via the stock function s(t \ 

The key element of the proof is that while the purchase processes depend on each other, they do not 
depend on the no-purchase arrivals. The proof is given in the Appendix. Also in the Appendix we 
show how the mean function A^’*(0, T) can be expressed in terms of A(0, T \ 77 '^) and thus computed 
efficiently, provided the rate function is integrable. 

2.6 Prior Distributions 

Finally, we specify a prior distribution for each of the latent variables: 77, 6, and <p and t as required 
by the choice model. The variables 6, 4>, and r are all probability vectors, so the natural choice is 
to assign them a Dirichlet or Beta prior: 

0 ^ Dirichlet (a) 

0^ ^ Dirichlet(/3), k = 1,..., K 
^ Beta( 7 ), k = 1,... ,K. 

In our experiments, we used uniform priors by setting the hyperparameters to vectors of 1. Similarly, 
a natural choice for the prior distribution of 77 is a uniform distribution for each element: 

? 7 ^ ~ Uniform(<5’'), v = 1,... ,\r]'^\, a = l,...,S. 

In our experiments we chose the interval S'’ large enough to not be restrictive. 



3 MCMC Sampling 

We use MCMC techniques to simulate posterior samples, specifically the stochastic gradient Rieman- 
nian Langevin dynamics (SGRLD) algorithm of [T7j. SGRLD was developed for posterior inference 
in topic models, to which our model is conceptually similar. It uses a stochastic gradient that does 
not require the full likelihood function to be evaluated in every MCMC iteration, which is critical 
for doing posterior inference on a potentially very large transaction database. 

We first transform each of the probability variables using the expanded-mean parameterization 
m- Consider the latent variable 9, which has as constraints 6k > 0 and ~ 1- Take 9 

a random variable with support on K;^, and give 9 a prior distribution consisting of a product of 
Gamma(Q!fc,l) distributions: 

K 

p{6 I a) cx exp(-4)- 

The posterior sampling is done over variables 9 by mirroring any negative proposal values about 
0. We then set dk = 6k/ This parameterization is equivalent to sampling on 9 with a 

Dirichlet(Q:) prior, but does not require the probability simplex constraint. The same transformation 
is done to 4>^ and t^. 
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Let 2 : = {r7,0,</>,T} represent the complete collection of transformed latent variables whose 
posterior distribution we are inferring. From state on MCMC iteration ru, the next iteration 
moves to the state according to 


+ Y (diag( 2 ;„)V \ogp{Zy, \ t, a, /3, 7, S,N,T) + 1) 

+ diag(2:u,)^, 

Ip ~ A/'(0,eu,/). 


The iteration performs a gradient step plus normally distributed noise, using the natural gradient 
of the log posterior, which is the manifold direction of steepest descent using the metric G{z) = 
diag(z)“^. Using Bayes’ theorem, the posterior gradient can be decomposed into the likelihood 
gradient and the prior gradient, and we use a stochastic gradient approximation for the likelihood 
gradient. On MCMC iteration w, rather than use all time periods to compute the gradient we 
use a uniformly sampled collection of time periods /IJ)- The gradient approximation is then 


Vlogp(t I Zyj,N,T) 

s 


E E E ^ E >»8 {>■’■'{&) - Ar'to, T) 


(T—l 


lecz, i=i 




The iterations will converge to the posterior samples if the step size schedule is chosen such that 
^'Ad < 00 [25] ■ In our simulations and experiments we used three time 

periods for the stochastic gradient approximations. We followed m and took = a((l + q/b) 
with parameters a, 6, and c chosen using cross-validation over a grid to minimize out-of-sample 
perplexity. We drew 10,000 samples from each of three chains initialized at a local maximum a 
posteriori solution found from a random sample from the prior. We verified convergence using 
the Gelman-Rubin diagnostic after discarding the first half of the samples as burn-in [7], and then 
merged samples from all three chains to estimate the posterior. 


4 Simulation Study 

We use a collection of simulations to illustrate and analyze the model and the inference procedure. 
We use a variety of rate functions and choice models throughout the simulations to demonstrate 
this flexibility of the model. First we use the simulations to verify that the posterior concentrates 
around the true generating values for a wide selection of arrival rate functions, choice models, and 
model parameters. Then we use simulations to investigate the dependence on the amount of data 
used in the inference. The simulations show that the posterior variance decreases as the size of the 
training data set increases, which is remarkable inasmuch as the reduction of uncertainty came with 
no additional computational cost because of the stochastic gradient approximation for the likelihood. 

The first set of simulations used the homogeneous rate function A(t | ry'’’) = yyf and the exogenous 
choice model given in ([^, with S' = 3 stores, K = 2 segments, and n = 3 items. The choice model 
parameters were fixed at = 0.75, (p^ = [0.75,0.2,0.05], and = [0.33,0.33,0.34]. For 

each of 10 simulated data sets, the segment distributions were chosen independently at random 
from a uniform Dirichlet distribution and the arrival rates rj^ were chosen independently at random 
from a uniform distribution on [2,4]. For each store, we simulated 25 time periods, each of length 
T = 1000 and with the initial stock for each item chosen uniformly between 0 and 500, independently 
at random for each item, time period, and store. Purchase data were then generated according to 
the generative model in Section [27^ Figure [l] shows the posterior means estimated from the MCMC 
samples across the 10 repeats of the simulation, each with different segment distributions and rate 
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Figure 1: Markers in the top panel show, for each randomly chosen value of rji used in the set of 
simulations (3 stores x 10 simulations), the corresponding estimate of the posterior mean. The bot¬ 
tom panel shows the same result for each value of used (3 stores x 2 segments x 10 simulations). 
For a range of generating parameter values, the posterior distributions were centered on the true 
values. 


parameters. This figure shows that across the full range of parameter values used in these simulations 
the posterior mean was close to the true, generating value. 

In the second set of simulations we used the Hill rate function with the nonparametric choice 
model, with 3 items. We used all sets of preference rankings of size 1 and 2, which for 3 items 
requires a total of 9 segments. We simulated data for a single store, with the segment proportion 
set to 0.33 for preference rankings {1}, {1,2}, and {3,2}: The first segment prefers item 1 and will 
leave with no purchase if item 1 is not available, the second segment prefers item 1 but is willing to 
substitute to item 2, and the third segment prefers item 3 but is willing to substitute to item 2. The 
segment proportions for the remaining 6 preference rankings were set to zero. With this simulation 
we study the effect of the number of time periods used in the inference, L^. was taken from 
{5,10, 25, 50,100}, and for each of these values 10 simulations were done. 

As in Figure]^ the posterior densities for the segment proportions were concentrated near their 
true values. Figure]^ shows how the posteriors depended on the number of time periods of available 
data. The top panel shows that the posterior means for the non-zero segment proportions tended 
closer to the true value as more data were made available. The bottom panel shows the actual con¬ 
centration of the posterior, where the interquartile range of the posterior decreased with the number 
of time periods. Because we use a stochastic gradient approximation, using more time periods 
came at no additional computational cost: We used 3 time periods for each gradient approximation 
regardless of the available number. 
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Figure 2: Each marker corresponds to the posterior distribution for 0^ from a simulation with the 
corresponding number of time periods, across the 3 values of k where the true value equaled 0.33. 
The top panel shows the posterior mean for each of the simulations across the different number 
of time periods. The bottom panel shows the interquartile range (IQR) of the posterior. As the 
amount of available data increased, the posterior distributions became increasingly concentrated on 
the true values. 

5 Case Study: Bakery Sales 

We now provide the results of the model applied to real transaction data. As part of our case study, 
we evaluate the predictive power of the model and sample the posterior distribution of lost sales due 
to stockouts. 

We obtained one semester of sales data from the bakery at 100 Main Marketplace, a cafe located 
at MIT, for a collection of cookies: oatmeal, double chocolate, and chocolate chip. The data set 
included all purchase times for 151 days; we treated each day as a time period (11:00 a.m. to 7:00 
p.m.), and there were a total of 4084 purchases. Stock data were not available, only purchase times, 
so for the purpose of these experiments we set the initial stock for each time period equal to the 
number of purchases for the time period - thus every item was treated as stocked out after its last 
recorded purchase. This may be a reasonable assumption for these cookies given that they are 
perishable baked goods which are meant to stock out by the end of the day, but in any case the 
experiments still provide a useful illustration of the method. 

The empirical purchase rate for the cookies, shown in Figure]^ was markedly nonhomogeneous: 
there is a broad peak at lunch time and two sharp peaks at common class ending times. We modeled 
the rate function with a combination of the Hill function X^{t) Q and a fixed function consisting 
of only two peaks at the two afternoon peak times, AP(t), obtained via a spline. The Hill function 
has three parameters, and then a fourth parameter provided the weight of the fixed peaks that were 
added in: X{t \ t]) = X^{t \ r]i,r] 2 , Va )We fit the model separately with the exogenous and 
nonparametric choice models. 


9 






Figure 3: A normalized histogram of purchase times for the cookies, across time periods, along with 
posterior samples for the model’s corresponding predicted purchase rate. 



Figure 4: Normalized histogram of posterior samples of the exogenous choice model substitution rate, 
for the cookie data. This is the probability that a customer will substitute if his or her preferred 
item is out of stock. 


Figurej^shows 20 posterior samples for the model’s predicted average purchase rate over all time 
periods, which equals X]i=i ^ from the fit with the nonparametric choice model. These 

samples show that the model provides an accurate description of the arrival rate. The variance in 
the samples provides an indication of the uncertainty in the model, which further motivates the use 
of the posterior predictive distribution over a point estimate for making predictions. 

Figure shows the posterior density for the substitution rate r, obtained by fitting the model 
with the exogenous choice model. The substitution rate is very low, indicating that most customers 
left without a purchase if their preferred cookie was not in stock. The posterior distribution of the 
item preference vector is given in Chocolate chip cookies were the strong favorite, followed by 
double chocolate and lastly oatmeal. 

5.1 Predictive Performance 

The next set of experiments establish that the model has predictive power on real data. We evaluated 
the predictive power of the model by predicting out-of-sample purchase counts during periods of 
varying stock availability. We took the first 80% of time periods (120 time periods) as training data 
and did posterior inference. The latter 31 time periods were held out as test data, the goal being to 
use data from the first part of the semester to make predictions about the latter part. We considered 
each possible level of stock unavailability, ie., s = [1,0,0], s = [0,1,0], etc. For each stock level, we 
found all of the time intervals in the test periods with that stock. The prediction task was, given only 
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Figure 5: Normalized histograms of posterior samples of the exogenous choice model preference 
vector, for the cookie data. This vector provides the probability that each item is a customer’s 
primary choice. 


the time intervals and the corresponding stock level, to predict the total number of purchases that 
took place during those time intervals in the test periods. The actual number of purchases is known 
and thus predictive performance can be evaluated. There were no intervals where only chocolate 
chip cookies were out of stock, but predictions were made for every other stock combination. 

This is a meaningful prediction task because good performance requires being able to accurately 
model both the arrival rate as a function of time and how the actual purchases then depend on the 
stock. We compare predictive performance to a baseline model that has previously been proposed 
for this problem by [24) . which is the maximum likelihood model with a homogeneous arrival rate 
and the MNL choice model. We discuss this and other related works in more detail in Section HI 

For the MNL baseline, the parameter is unidentifiable and cannot be estimated. We fit the 
model for each fixed G {0.1, 0.2,..., 0.9}, and show here the results with the value of that 
minimized the out-of-sample absolute deviation between the model expected number of purchases 
and the true number of purchases, which was 0.4. That is, we show here the results that would 
have been obtained if we had known a priori the best value of r^, and thus show the best possible 
performance of the baseline. 

For our model, for each choice model (nonparametric and exogenous) posterior samples obtained 
from the MCMC procedure were used to estimate the posterior predictive distribution for the number 
of purchases under each stock level. For the maximum likelihood baseline, we used simulation 
to estimate the distribution of purchase counts conditioned on the point estimate model. These 
posterior densities, smoothed with a kernel density estimate, are given in Figure Despite their 
very different natures, the predictions made by the exogenous and nonparametric models are quite 
similar, and both have posterior means close to the true values for all stock levels. The baseline 
maximum likelihood model with a homogeneous arrival rate and MNL choice performs very poorly. 

5.2 Lost Sales Due to Stockouts 

Our purpose in inferring the model is to use it to make better stocking decisions. An important 
starting point is to use the inferred parameters to estimate what the sales would have been had there 
not been any stockouts. This allows us to know how much revenue is being lost with our current 
stocking strategy. We estimated posterior densities for the number of purchases of each item across 
151 time periods, with full stock. Figure [^compares those densities to the actual number of cookie 
purchases in the data. 

For each of the cookies, the actual number of purchases was significantly less than the posterior 
density for purchases with full stock, indicating that there were substantial lost sales due to stockouts. 
With the nonparametric model, the difference between the full-stock posterior mean and the actual 
number of purchases was 791 oatmeal cookies, 707 double chocolate cookies, and 1535 chocolate chip 
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Figure 6: Smoothed posterior densities for the number of purchases during test set intervals with 
the indicated stock availability for cookies [oatmeal, double chocolate, chocolate chip]. The density 
in blue is for the nonparametric choice, red is for the exogenous choice, and gray is for the baseline 
homogeneous arrival rate with MNL choice. The vertical line indicates the true value. 


cookies. Figure shows that customers were generally unwilling to substitute, which would have 
contributed to the lost sales. 


6 Related Work 

The primary work on this problem, estimating demand and substitution from sales transaction data 
with stockouts and unobserved no-purchases, was done by |24] . They model customer arrivals using a 
homogeneous Poisson process within each time period, meaning the arrival rate is constant through¬ 
out each time period. Customers then choose an item, or an unobserved no-purchase, according to 
the MNL choice model. They derive an EM algorithm to solve the corresponding maximum likeli¬ 
hood problem. In the prediction task of Section [O] we compared our results with this model as the 
baseline and found that it was unable to make accurate predictions with our case study data. Our 
model overcomes several limitations of this model, thereby substantially advancing the power of the 
inference and the settings in which the model can be used. First, Figure shows that the arrivals 
are significantly nonhomogeneous throughout the day, and modeling the arrival rate as constant 
throughout the day is likely the reason the baseline model failed the prediction task. The work in 
[24| proposes extending their model to a nonhomogeneous setting by choosing sufficiently small time 
periods that the arrival rate can be approximated as piecewise constant. However, with the level of 
nonhomogeneity seen in Figure it is implausible that accurate estimation could be done for the 
number of segments (and thus separate rate parameters) required to model the arrival rate with 
a piecewise-constant function. Second, our model does not require using the MNL choice model, 
which avoids the issue with the parameter t being unidentifiable. This parameter represents the 
proportion of arrivals that do not purchase anything even when all items are in stock, and is not 
something that a retailer would necessarily know. Finally, we take a Bayesian approach to inference 
and produce posterior predictive distributions. This becomes especially important in this setting 
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Purchases, double chocolate 



Purchases, chocolate chip 


Figure 7: Smoothed posterior densities for the number of purchases during all periods, had there 
had been no stockouts. The blue density is the result with the nonparametric choice model, and the 
red with the exogenous. The vertical line indicates the number of purchases in the data. 


where the parameters themselves are of secondary interest to using the model to make predictions 
about lost revenue and to make decisions about stocking strategies. 

Other work in this area includes [1], where customer arrivals are modeled with a homogeneous 
Poisson process and purchase probabilities are modeled explicitly for each stock combination, as 
opposed to using a choice model. Their model does not scale well to a large number of items 
as the likelihood expression includes all stock combinations found in the data. The work of [53] 
is extended in |22] to incorporate nonparametric choice models, for which maximum likelihood 
estimation becomes a large-scale concave program that must be solved via a mixed integer program 
subproblem. There is a large body of work on estimating demand and choice in settings different 
than that which we consider here, such as discrete time [511 na, panel or aggregate sales data 
[aiiiiii], negligible no purchases m, and online learning with simultaneous ordering decisions 
m- These models and estimation procedures do not apply to the setting that we consider here, 
which is retail transaction data with stockouts and unobserved no-purchases; m provide a review 
of the various threads of research in the larger field of demand and choice estimation. 

Our work fits into a growing body of work in advancing the use of statistics in areas of business. 
These areas include marketing [M [101121, market analysis mm, demand forecasting undo], and 
pricing mm- These works, and ours, address a real need for rigorous statistical methodologies in 
business, as well as a substantial opportunity for impact. 


7 Discussion 

We have developed a Bayesian model for inferring primary demand and consumer choice in the 
presence of stockouts. The model can incorporate a realistic model of the customer arrival rate, and 
is flexible enough to handle a variety of different choice models. Our model is conceptually related 
to topic models like latent Dirichlet allocation [3]. Variants of topic models are regularly applied 
to very large text corpora, with a large body of research on how to effectively infer these models. 
That research was the source of the stochastic gradient MCMC algorithm that we used, which allows 
inference from even very large transaction databases. 

The simulation study showed that when data were actually generated from the model, we were 
able to recover the true generating values. It further showed that the posterior bias and variance 
decreased as more data were made available, an improvement that came without any additional 
computational cost due to the stochastic gradient. 

In the case study we applied the model and inference to real sales transaction data from a local 
bakery. The daily purchase rate in the data was clearly nonhomogeneous, with several peak periods. 
These data clearly demonstrated the importance of modeling nonhomogeneous arrival rates in retail 
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settings. In a prediction task that required accurate modeling of both the arrival rate and the choice 
model, we showed that the model was able to make accurate predictions and significantly outperform 
the baseline approach. 

Finally, we showed how the model can be used to estimate lost sales due to stockouts. The 
posterior provided evidence of substantial lost cookie sales. The model and inference procedure we 
have developed provide a new level of power and flexibility that will aid decision makers in using 
transaction data to make smarter decisions. 
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Here we provide the derivation of the log-likelihood function. The proof will use two results 
which themselves are straightforward to show. 


Proposition 1. 
Proposition 2. 


pit] I = exp(-A(tj_i,tj I 'n‘^))\{tj I 


K{0,T\ig'^)=Y,Nl\0,T). 

z=0 
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of Theorem 1. We consider the complete arrivals which include both the observed arrivals 

<j ,l 

(7 I ( (T I '\ 

as well as the unobserved arrivals that left as no-purchase, which we here denote tp’ = j 

We define an indicator equal to i if the customer at time purchased item or 0 if this 
customer left as no-purchase. For store a and time period I, 


= P ^no arrivals in , T 




= exp(-A(t,^,..i,T I rj'")) 

X I r7'")exp(-A(0,ti’' | J7'"))7rj..i 
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We have then that 


IJ 7 ", T , iv , r ) 


<7,1 
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= J |r7^0^0,T,A^,^)dti 

cr, I 

™o 

exp(-Ao’'(0,T)) Ao’'(to;')dto’' 


i=i 


nexp(-Ar'(0,r)) n 


i=i 


= nexp(-A^’'(0,T)) ]J Xf’^iffj), 


Z=1 


since the last integrand is exactly the joint density for the arrivals from an NHPP with rate Ag’*(t), 
and so integrates to 1. 
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Given the model parameters, data are generated independently for each u and /, thus 


\ogp{t \'q,e,(j),T,N,T) 

S L” 

= log I 'n'", T, N, T) 


S 


= T,T,I 1 i:i»g(Ar'(G))-''r‘(o,r) 


<7 = 1 i=l i=l \i = i 


□ 

We now show how A'^’\o,T) can be expressed analytically in terms of A(0,T | For con¬ 

venience, in this section we suppress in the notation the dependence of the stock on past arrivals 
and initial stock levels and will write s{t \ as simply s{t). We consider each of the time 

intervals where the stock sit) is constant. Let the sequence of times qi’\... ,0^1,1 demarcate the 

intervals of constant stock. That is, [0, T] = ^r+i] 1® constant for t G [q^'^qr+i) 

for r = 1,..., — 1. Then, 

Af'(o,r) 

= rK’'m 

Jo 

T K 

k^l 

QCr,Z_2 / (j,l 

r=l k=l 

Q"’'-l / K \ 

r=l \fc=l / 

With this formula, the likelihood function can be computed for any parameterization \{t \ t]'^) 
desired so long as it is integrable. 
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